Personal Loan Campaign

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Dataset

The data contains characteristics of the people

Loading Libraries

Load data

View the first and last 5 rows of the dataset.

View random sample of data

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Summary of the dataset.

Data Pre-processing

Explore Experience column further

We will attempt to fix the negative values in the Experience column

Feature Engineering

Mapping ZIPCode to County to reduce the number of unique values.

There are still some values that do not belong to a California county. Let's replace nan values with the string unknown

Fixing Datatypes

Lets convert "County","Family","Education","Personal_Loan","Securities_Account","CD_Account","Online" and "CreditCard" to category.

At this point, we'll create two dataframes, one for each model.

Lets us look at different levels in categorical variables

Univariate Analysis

Numerical Variables

Observations on Age

Observations on Experience

Observations on Income

Observations on CCAvg

Observations on Mortgage

Categorical Variables

Observations on Family

Observations on Education

Observations on Securities_Account

Observations on CD_Account

Observations on Online

Observations on CreditCard

Observations on ZIPCode

Observations on salary

Bivariate analysis

Categorical Analysis

Family vs Personal_Loan

Personal_Loan vs Education

Personal_Loan vs Securities_Account

Personal_Loan vs CD_Account

Personal_Loan vs Online

Personal_Loan vs CreditCard

Numerical Analysis

Personal_Loan vs Age

Personal_Loan vs Experience

Personal_Loan vs Income

Personal_Loan vs CCAvg

Personal_Loan vs Income vs Education

Logistic Regression Model

Data Preparation

Creating training and test sets.

Building the model

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer is likely to purchase a loan but in reality the customer did not.
  2. Predicting a person isn't likely to purchase a loan but in reality the customer did.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression (with Sklearn library)

Checking model performance on training set

Checking performance on test set

Observations

Logistic Regression (with statsmodels library)

Observations

Additional Information on VIF

Multicollinearity

Removing Age

Dropping Age and Experience

Summary of the model without Age and Experience

Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train3 as the final ones and lg4 as final model.

Coefficient interpretations

Converting coefficients to odds

Coefficient interpretations

Interpretation for other attributes can be done similarly.

Checking model performance on the training set

ROC-AUC

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Let's check the performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

Using model with threshold=0.13

Using model with threshold = 0.33

Model performance summary

Let's try using a decision tree model

Decision Tree Model

Data Preperation

Model Building - Approach

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on the test set.

Split Data

Build Decision Tree Model

We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy, hence accuracy is not a good metric to evaluate here.

Insights:

The tree above is very complex and difficult to interpret.

Reducing Overfitting

Using GridSearch for Hyperparameter tuning of our tree model

Recall has improved for both train and test set after hyperparameter tuning, and we have got a generalized model.

Visualizing the Decision Tree

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfitting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of the pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path function that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.004 alpha.

Visualizing the Decision Tree

Comparing all the decision tree models

Decision tree model with post pruning has given the best recall score on the test data.

Conclusion

Recommendations

Decision Tree with Post-Pruning is our preferred model which follows all the assumptions, and can be used for interpretations.

The model is able to reduce a higher percentage of False Negatives compared to the Logistic Regression model.

The retail marketing department should devise campaigns to target: